CK Tile Group GEMM gfx1250 by aris134 · Pull Request #576 · ROCm/TransformerEngine

aris134 · 2026-05-06T17:54:50Z

Description

Extend the present CK tile grouped GEMM (F16/F8) implementation for compatibility with gfx1250. Replaces 3rdparty/aiter with 3rdparty/rocm-libraries for the gfx1250 changes from CK.

Fixes #16490

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Change A
Change B

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

…nfig

ipanfilo · 2026-05-07T17:20:27Z

  // Currently only support cutlass group gemm on Hopper Arch
-  if (!(is_hopper && use_cutlass)) {
+  //if (!(is_hopper && use_cutlass)) {
+  if (!use_cutlass) {


It is CUDA path

Reverted in 752e0d3

ipanfilo · 2026-05-07T17:24:22Z

+  using type = TileCfg_256x256x64_WMMA;
+};
+
+template <GPUArch Arch>


Why does it need template over reguler if-else or switch-case?

The template is needed because the arch selection affects CK kernel template instantiation, not just runtime control flow. GPUArch must be a compile-time value so if constexpr can prune unsupported tile/kernel combinations for a given architecture. In this case, it prevents the MFMA configs from being instantiated for gfx1250.

I didn't compile it with gfx1250 arch only but I was still puzzled about this templated dispatch. In line 298, you still rely on runtime detect_gpu_arch() to branch to specific ck_tile_grouped_gemm_fp16_dispatch_arch<arch_id>'s. So I presume all three arches verions will still be instantiated? And I didn't see any compile time guarding?

Good point. Before the latest change the runtime switch still referenced all ck_tile_grouped_gemm_fp16_dispatch_arch<...> specializations, so the compiler could still instantiate all arch variants. b55fe29 adds compile-time #if defined(__gfxXXX__) guards around each runtime dispatch case so unsupported arch paths are no longer instantiated.

wangye805 · 2026-05-21T05:06:39Z

+  static constexpr ck_tile::index_t M_Warp_Tile = 16;
+  static constexpr ck_tile::index_t N_Warp_Tile = 16;
+  static constexpr ck_tile::index_t K_Warp_Tile = 32;
+
+  static constexpr bool kPadM = true;
+  static constexpr bool kPadN = true;
+  static constexpr bool kPadK = true;


so the difference btw TileCfg_256x256x64_MFMA and TileCfg_256x256x64_WMMA is inside M, N, K warp tile and kPads?

The difference is not just the warp tile shape or kPads. MFMA and WMMA are different warp-level MMA instruction paths, so they lower through different warp dispatch/pipeline configurations with different tile and padding requirements.

Oh, I was comparing those two struct classes TileCfg_256x256x64_MFMA and TileCfg_256x256x64_WMMA. Inside those two defined structs, contents just differ by warp tile and kPads?

wangye805 · 2026-05-21T05:14:27Z

+  using type = TileCfg_256x256x64_WMMA;
+};
+
+template <GPUArch Arch>


I didn't compile it with gfx1250 arch only but I was still puzzled about this templated dispatch. In line 298, you still rely on runtime detect_gpu_arch() to branch to specific ck_tile_grouped_gemm_fp16_dispatch_arch<arch_id>'s. So I presume all three arches verions will still be instantiated? And I didn't see any compile time guarding?

wangye805 · 2026-05-21T05:15:49Z

  COMPILE_OPTIONS "-g0;-dopt=on")
 else()
-  set(CK_ROOT ${CMAKE_CURRENT_SOURCE_DIR}/../../3rdparty/aiter/3rdparty/composable_kernel)
+  set(CK_ROOT ${CMAKE_CURRENT_SOURCE_DIR}/../../3rdparty/rocm_libraries/projects/composablekernel)


nit: Will the whole rocm_libraries too big? Do we have a way to have sparse check out for this ck subdir?

Good point. The full rocm_libraries checkout is fairly large (~8G locally), while projects/composablekernel alone is much smaller (~167M). Yeah, sparse checkout probably makes sense here, but I am wondering if it would be better handled in a separate PR.

Yeah, we can do it in separate PR

…emm-gfx1250-clean

aris134 added 2 commits May 6, 2026 13:42

initial ck group gemm fp16/fp8 integration

d52075d

CK grouped GEMM: normalize FP8 dispatch to NT and add gfx1250 tile co…

2934c99

…nfig

aris134 assigned matthiasdiener and aris134 May 6, 2026

aris134 requested a review from wenchenvincent as a code owner May 6, 2026 17:54

aris134 added the ci-level 1 CI test level 1 label May 6, 2026

aris134 requested review from ipanfilo and wangye805 as code owners May 6, 2026 17:54

aris134 assigned aris134 and unassigned matthiasdiener and aris134 May 6, 2026

aris134 requested a review from matthiasdiener May 6, 2026 17:55

aris134 changed the title ~~CK Tile Group GEMM GFX1250~~ CK Tile Group GEMM gfx1250 May 6, 2026

ipanfilo requested changes May 7, 2026

View reviewed changes

revert change in CUDA path

752e0d3

aris134 requested a review from ipanfilo May 11, 2026 13:04

Add direct ROCm libraries dependency for CK grouped GEMM

9ea316d

wangye805 requested changes May 21, 2026

View reviewed changes

address pr comments

b55fe29

aris134 requested a review from wangye805 May 21, 2026 17:02

wangye805 approved these changes May 21, 2026

View reviewed changes

Merge remote-tracking branch 'origin/gfx1250' into amartin/ck-group-g…

8f2acdc

…emm-gfx1250-clean

aris134 merged commit a67bbe9 into gfx1250 May 21, 2026
3 checks passed

Conversation

aris134 commented May 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Type of change

Changes

Checklist:

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

aris134 commented May 6, 2026 •

edited

Loading